[WIP][BREAKING][worker, ckpt] support checkpoint engine for sync parameters in hybrid mode#4602
[WIP][BREAKING][worker, ckpt] support checkpoint engine for sync parameters in hybrid mode#4602kip-cxj wants to merge 4 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a checkpoint_engine to handle weight synchronization in hybrid mode, with changes across FSDP and Megatron workers, as well as vLLM and SGLang rollouts. The implementation adds new configuration flags and logic to conditionally use this new engine.
My review has identified a few critical issues:
- A potential bug in
fsdp_workers.pywhere incorrect weights might be updated due to a variable mix-up. This is coupled with significant code duplication that should be refactored. - Unsafe access to environment variables in
vllm_rollout.pywhich could lead to worker crashes if the environment is not perfectly configured.
I've provided code suggestions to address these critical issues for improved correctness and robustness.
| rank = int(os.environ["RANK"]) | ||
| local_world_size = int(os.environ["RAY_LOCAL_WORLD_SIZE"]) |
There was a problem hiding this comment.
Accessing environment variables using os.environ["KEY"] is unsafe as it will raise a KeyError if the variable is not set, causing the worker to crash. It's much safer to use os.getenv("KEY", default_value).
This is a critical issue that can lead to runtime crashes if the environment is not perfectly configured.
| rank = int(os.environ["RANK"]) | |
| local_world_size = int(os.environ["RAY_LOCAL_WORLD_SIZE"]) | |
| rank = int(os.getenv("RANK", "0")) | |
| local_world_size = int(os.getenv("RAY_LOCAL_WORLD_SIZE", "1")) |


What does this PR do?
Implement the
checkpoint engineas a standalone update weights module for hybrid mode.Currently, the inference backend does not support vllm. Adaptation for the vLLM module requires separating actor and rollout into separate processes:#4280
Checklist Before Starting
1. [BREAKING][recipe, ckpt] feat: support parameter sync by checkpoint-engine. only for fully_async mode. #4427
2. [wip][BREAKING][recipe, ckpt]add checkpoint engine for one step off policy #4601
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)